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Abstract 

I survey some recent applications-oriented 
NL generation systems, and claim that de- 
spite very different theoretical backgrounds, 
these systems have a remarkably similar ar- 
chitecture in terms of the modules they di- 
vide the generation process into, the compu- 
tations these modules perform, and the way 
the modules interact with each other. I also 
compare this 'consensus architecture' among 
applied NLG systems with psycholinguistic 
knowledge about how humans speak, and ar- 
gue that at least some aspects of the con- 
sensus architecture seem to be in agreement 
with what is known about human language 
production, despite the fact that psycholin- 
guistic plausibility was not in general a goal 
of the developers of the surveyed systems. 



1 Introduction 

In this paper I survey some recently-developed NL gen- 
eration systems that (a) cover the complete generation 
process and (b) are designed to be used by application 
programs, as well as (or even instead of) making some 
theoretical point. I claim that despite their widely dif- 
fering theoretical backgrounds, the surveyed systems 
are similar in terms of the modules they divide the 
generation process into, the way the modules interact 
with each other, and (at least in some cases) the kinds 
of computations each individual module performs. In 
other words, despite different theoretical claims, there 
is a remarkable level of similarity in how these sys- 
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terns 'really work'; that is, a de facto 'consensus ar- 
chitecture' seems to be emerging for how applied NLG 
systems should generate text. The existence of such 
agreement among the surveyed systems is especially 
surprising because in some cases the theoretical back- 
grounds of the systems examined argue against some 
aspects of the consensus architecture. 

I also compare the consensus architecture to psy- 
cholinguistic knowledge about language generation in 
human speakers. Such a comparison is often diffi- 
cult to make, because of the many gaps in our cur- 
rent knowledge about how humans speak. Neverthe- 
less, I argue that as far as such a comparison can 
be made, the specific design decisions embodied in 
the consensus architecture seem to often be more or 
less in accord with current knowledge of human lan- 
guage generation. This is again perhaps somewhat 
surprising, since psycholinguistic plausibility was not 
in general a goal of the developers of the examined sys- 
tems. Perhaps (being very speculative) this indicates 
that there is some connection between the engineer- 
ing considerations that underlie the design decisions 
made in the consensus architecture, and the maximize- 
pcrformance-in-the-real-world criteria that drove the 
evolutionary processes that created the human lan- 
guage processor. If (a big if!) there is some truth 
to this hypothesis, then studying the engineering is- 
sues involved in building applied systems may lead to 
insights about the way the human language system 
works. 

2 The Systems Surveyed 

The analysis presented here is based on a survey of 
generation systems that: 

1. Were written (or at least substantially extended) 
since the late 1980s. This excludes early systems 
such as Davey's proteus or Jacobs's king. 



2. Are complete systems that start from an inten- 
tion, a query, or some data that needs to be com- 
municated, and produce actual sentences as out- 
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software designs), and are all based on Mel'cuk's 
Meaning-Text theory flVlel'cuk, 198^]. 
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PENMAN IjPcnman 



Natural 



Language Group, 1989[i Under development at 
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edge system (discourse planning) or my own FN 
(noun-phrase construction) . 

3. Were motivated, at least to some degree, by the 
desire to interface to application programs. This 



excludes systems that were primarily intended to 



be computational explorations of a particular lin- 
guistic theory, such as Patten's SLANG, or com- 
putational models of observed linguistic behavior, 
such as Hovy's pauline. 

4. Are well enough known that I could easily obtain 
information about them. 

In short, the idea was to survey recent systems that 
looked at the entire generation problem, and that were 
motivated by applications and engineering considera- 
tions as well as linguistic theory. The systems exam- 
ined were:|j 



FUF jElhadad, 1992| : Developed at Columbia Uni- 
versity and used in several projects there, includ- 
ing comet and ADVISOR II; I will use the term 
'fuf' in this paper to refer to both fuf itself and 
the various related systems at Columbia. Several 
other universities have also recently begun to use 
fuf in their research, fuf is based on Kay's func- 



tional unification formalism [Kay, 1979 



IDAS [ |Reiter et al., 19921 : Developed at Edinburgh 
University, IDAS was a prototype online docu- 
mentation system for users of complex machinery. 
From a theoretical perspective, idas's main objec- 
tive was to show that a single representation and 
reasoning system can be used for both domain and 



linguistic knowledge [Rcitcr and Mcllish, 1992 



JOYCE iRambow and Korelsky, 1992{ : Developed 
at Odyssey Research Associates, joyce is taken 
as a representative of several NL generation sys- 
tems produced by ORA and CoGenTex, includ- 
ing GOSSIP, FOG, and lfs. These systems are 
all aimed at commercial or government applica- 
tions (in Joyce's case, producing summaries of 

1 The selection rules are of course not completely well de- 
fined, which means there was inevitably some arbitrariness 
when I used them to select particular systems to include 
in the survey. I encourage any reader who believes that I 
have unfairly omitted a system to contact me, so that this 
system can be included in future versions of the survey. 



ISI since the early 1980's, penman has been used 
in several demonstration systems. As usual, I will 
use 'penman' to refer to both penman itself and 
the systems that were built around it. penman's 



theor etical basis is systemic linguistics [ Halliday. 



1985] and rhetorical-structure theory. 



SPOKESMAN flVIctccr, 1989|: SPOKESMAN was de- 



veloped at BBN for various applications, and has 
some of the same design goals as McDonald's 
MUMBLE system [ McDonald, 1983{ |, including in 
particular the desire to build a system that at 
least in some respects is psycholinguistically plau- 
sible, spokesma n uses Tree-Adjoining Gram- 
mars [ Joshi, 1987 for syntactic processing. 



All of the examined systems produce English, and they 
also are mostly aimed at producing technical texts (in- 
stead of, say, novels or newspaper articles); it would 
be interesting to examine systems aimed at other lan- 
guages or other types of applications, and see if this 
caused any architectural differences. 

3 An Overview of the Consensus 
Architecture 

As can be seen, the chosen systems have widely dif- 
ferent theoretical bases. It is therefore quite interest- 
ing that they all seem to have ended up with broadly 
similar architectures, in that they break up the gener- 
ation process into a similar set of modules, and they 
all use a pipeline architecture to connect the modules; 
i.e., the modules are linearly ordered, and information 
flows from each module to its successor in the pipeline, 
with no feedback from later modules to earlier mod- 
ules. The actual modules possessed by the systems 
(discussed in more detail in Section [|, as is the pipeline 
architecture) are: 

Content Determination: This maps the initial in- 
put of the generation system (e.g., a query to be 
answered, or an intention to be satisfied) onto a 
semantic form, possibly annotated with rhetorical 
(e.g., RST) relations. 

Sentence Planning: Many names have been used 
for this process; he re I u se one suggested by Ram- 
bow and Korelsky [ 1992 ] . The basic goal is to map 
conceptual structures onto linguistic ones: this in- 
cludes generating referring expressions, choosing 



content words and (abstract) grammatical rela- 
tionships, and grouping information into clauses 
and sentences. 

Surface Generation: I use this term in a fairly nar- 
row sense here, to mean a module that takes as in- 
put an abstract specification of information to be 
communicated by syntax and function words, and 
produces as output a surface form that commu- 
nicates this information (e.g., maps :speechact 
imperative into an English sentence that lacks 
a surface subject). All of the examined sys- 
tems had separate sentence-planning and surface- 
generation modules, and the various intermedi- 
ate forms used to pass information between these 
modules conveyed similar kinds of information. 

Morphology: Most of the systems have a fairly sim- 
ple morphological component, presumably since 
English morphology is quite simple. 

Formatting: IDAS, JOYCE, and penman also contain 
mechanisms for formatting (in the I^TgX sense) 
their output, and/or adding hypertext annota- 
tions to enable users to click on portions of the 
generated text. 

4 A More Detailed Examination of 
the Architecture 

This section describes the consensus architecture in 
more detail, with particular emphasis on some of the 
design decisions embodied in it that more theoreti- 
cally motivated researchers have disagreed with. It 
furthermore examines the plausibility of these deci- 
sions from a psycholinguistic perspective, and argues 
that in many respects they agree with what is known 
about how humans generate text. 

4.1 Modularized Pipeline Architecture 

The consensus architecture divides the generation pro- 
cess into multiple modules, with information flowing in 
a 'pipeline' fashion from one module to the next. By 
pipeline, I mean that the modules are arranged in a lin- 
ear order, and each module receives information only 
from its predecessor (and the various linguistic and do- 
main knowledge bases) , and sends information only to 
its successor. Information does not flow 'backwards' 
from a module to its predecessor, and global 'black- 
boards' that all modules can access and modify are 
not used. I do not mean by 'pipeline' that generation 
must be incremental in the sense that, say, syntactic 
processing of the first sentence is done at the same time 
as semantic processing of the second; I believe most of 



the systems examined could in fact do this, but they 
have not bothered to do so (probably because it would 
not be of much benefit to the applications programs of 
interest). 

4.1.1 Design decision: avoid integrated 
architecture 

Many NL generation researchers have argued 
against dividing the generation process into modules; 
perhaps the best-known are Appelt [1985] and Dan- 
los |1984[. Others, such as Rubinoff ]I992|], have ac- 



cepted modules but have argued that the architecture 
must allow feedback between later modules and earlier 
modules, which argues against the one-way informa- 
tion flow of the pipeline architecture. 

The argument against pipelines and modules is al- 
most always some variant of 'there are linguistic phe- 
nomena that can only be properly handled by looking 
at constraints from different levels (intentional, seman- 
tic, syntactic, morphological), and this is difficult to do 
in a pipeline system.' To take one fairly random exam- 
ple, Danlos and Namer [1988] have pointed out that 
since the French masculine and feminine pronouns le 
and la are abbreviated to V before a word that starts 
with a vowel, and since in some cases le and la may 
be unambiguous references while I' is not, the refer- 
ring expression system must have some knowledge of 
surface word order and selected content and function 
words before it can decide whether a pronoun is ac- 
ceptable; this will not be possible if referring expres- 
sions are chosen before syntactic structures are built, 
as happens in the consensus architecture. 

There is undoubtably some truth to these argu- 
ments, but the applications builder also has to con- 
sider the engineering reality that the sorts of systems 
proposed by Appelt, Danlos, and Namer are extremely 
difficult to build from an engineering perspective. The 
engineering argument for modularization is particu- 
larly strong; Marr has put this very well in [ Marr, 197q , 
page 485]: 

Any large computation should be split up and 
implemented as a collection of small subparts 
that are as nearly independent of one another 
as the overall task allows. If a process is not 
designed in this way a small change in one 
place will have consequences in many other 
places. This means that the process as a 
whole becomes extremely difficult to debug 
or improve, whether by a human designer or 
in the course of natural evolution, because a 
small chance to improve one part has to be 
accompanied by many simultaneous compen- 



satory changes elsewhere. 

Marr argues that a modularized structure makes sense 
both for human engineers and for the evolutionary pro- 
cess that produced the human brain. The evidence is 
indeed strong that the human brain is highly modu- 
larized. This evidence comes from many sources (e.g., 
cognitive experiments and PET scans of brain activ- 
ity), but I think perhaps the most convincing evidence 
is from studies of humans with brain damage. Such 
people tend to lose specific abilities, not suffer overall 
degradation th at ap plies equally to all abilities. El- 
lis and Young [ l98q ] provide an excellent summary of 
such work, and list patients that, for example 

• can produce syntactically correct utterances but 
can not organize utterances into coherent wholes, 
i.e., can perform surface generation but not con- 
tent determination. 

• can generate word streams that tell a narrative 
but are not organized into sentences, i.e., can per- 
form content determination but not surface gen- 
eration. 

• can produce coherent texts organized in grammat- 
ical structures, but have a severely restricted vo- 
cabulary; i.e., have impaired lexical choice (these 
patients still have conceptual knowledge, they just 
have problems lexicalizing it). 

The main engineering argument for arranging mod- 
ules into a pipeline instead of a more complex structure 
is again simplicity and ease of debugging. In a one-way 
pipeline of N modules there are only N-l interfaces be- 
tween modules, while a pipeline with 'two-way' infor- 
mation flow has 2(N-1) interfaces, and a system that 
fully connects each module with every other module 
will have N(N-l) interfaces. A system that has a two- 
way interface between every possible pair of modules 
will undoubtably be able to handle many linguistic 
phenomena in a more powerful, elegant, principled, 
etc, manner than a system that arranges modules in a 
simple one-way pipeline; such a system will also, how- 
ever, be much more difficult to build and (especially) 
debug. 

It is easy to argue that a one-way pipeline is worse 
at handling some linguistic phenomena than a richly- 
connected architecture, but this is not the end of the 
story for the system-building engineer; he or she has 
to balance the cost of the pipeline being inefficient 
and/or inelegant at handling some phenomena against 
the benefit of the pipeline being a much easier struc- 
ture to build and debug. We have insufficient engi- 
neering data at present to make any well-substantiated 



claims about whether the one-way pipeline has the op- 
timal cost/benefit tradeoff or not (and in any case this 
will probably depend somewhat on the circumstances 
of each application [Elcitcr and Mellish, 1993|), but 



the circumstantial evidence on this question is striking; 
despite the fact that so many theoretical papers have 
argued against pipelines and very few (if any) have 
argued for pipelines, every one of the applications- 
oriented systems examined in this survey chose to use 
the one-way pipeline architecture. 

In other words, an applications systems builder can 
not look at particular linguistic phenomena in isola- 
tion; he or she must weigh the benefits of 'properly' 
handling these phenomena against the cost of imple- 
menting the proposed architecture. In the French pro- 
noun case described by Danlos and Namer, for exam- 
ple, the applications builder might argue that in the 
great majority of cases no harm will in fact be done if 
the referring-expression generator simply ignores the 
possibility that pronouns may be abbreviated to l\ es- 
pecially given humans' ability to use context to disam- 
biguate references; and if a situation does arise where 
it is absolutely essential that the human reader be able 
to correctly disambiguate a reference, then perhaps 
pronouns should not be used in any case. Given this, 
and the very high engineering cost of building an in- 
tegrated architecture of the sort proposed by Danlos 
and Namer, is implementing such an architecture truly 
the most effective way of using scarce engineering re- 
sources? 

Psycholinguistic research on self-monitoring and 
self-repair (summarized in [Lcvclt, 1989, pages 458- 



299]) suggests that there is some feedback in the hu- 
man language generation system, so the human lan- 
guage processor is probably more complex than a sim- 
ple one-way pipeline; but it may not be much more 
complex. To the best of my knowledge, most of the 
observed self-repair phenomena could be explained by 
an architecture that added a few feedback loops from 
later stages of the pipeline back to the initial planner; 
this would only slightly add to the number of inter- 
module interfaces (perhaps N+l instead of N-l, say), 
and hence would have a much lower engineering cost 
than implementing the fully connected 'every module 
communicates with every other module' architecture. 
Whether the human language engine is organized as a 
'pipeline plus a few feedback loops' or an 'every module 
talks to every other module' architecture is unknown 
at this point; hopefully new psycholinguistic experi- 
ments will shed more light on this issue. I think it 
would be very interesting, for example, to test human 
French speakers on situations of the sort described by 



Danlos and Namer, and see what they actually did in 
such contexts; I do not believe that such an experiment 
has (to date) been performed. 

4.2 Content Determination 

Content determination takes the initial input to the 
generation system, which may be, for example, a query 
to be answered or an intention to be satisfied, and pro- 
duces from it a 'semantic form', 'conceptual represen- 
tation', or 'list of propositions', i.e., a specification of 
the meaning content of the output text. I will in this 
paper use the term semantic representation for this 
meaning specification. Roughly speaking, the seman- 
tic representations used by all of the examined sys- 
tems can be characterized as some kind of 'semantic 



net' (using the term in its broadest sense, as in [3owa 



1991]) where the primitive elements in the net are con- 



ceptual instead of linguistic (e.g., domain KB concepts 
instead of English words). In some cases the seman- 
tic nets also include discourse and rhetorical relations 
between portions of the net; subsequent portions of 
the generator use these to generate discourse connec- 
tives (e.g., However), control formatting (e.g., the use 
of bulletized lists), etc. 

The systems examined use quite different content- 
determination mechanisms (i.e., there was no consen- 



sus); schemas [McKeown, 1985 1 were the most popular 
approach. 

4.2.1 Design decision: integrated content 

determination and rhetorical planning 

Content determination in the systems examined ba- 
sically performs two functions: 

Deep content determination: Determine what in- 
formation should be communicated to the hearer. 

Rhetorical planning: Organize this information in 
a rhetorically coherent manner. 



Hovy [1988] has proposed an architecture where 



these tasks are performed separately (in particular, the 
application program performs deep content determina- 
tion, while the generation system performs rhetorical 
planning). Among the systems examined, however, 
Hovy is unique in taking this approach; the builders of 
the other systems (including Moore and Paris [1989|, 
who also worked with penman) apparently believe 
that these two processes are so closely related that 
they should be performed simultaneously. 

I am not aware of any psychological data that di- 
rectly address this issue. However, Hovy's architec- 
ture requires the language-producing agent to com- 
pletely determine the content of a paragraph before 



he/she/it can begin to utter it (since the rhetorical 
planner determines what the first sentence is, and it 
is not called until deep content determination is com- 
pleted), and intuitively it seems implausible to me that 
human speakers do this; it also goes against incremen- 
tal theories of human speech production [ Levelt, 1989] , 
pages 24-27]. 

4.3 Sentence Planning 

The sentence planner converts the semantic represen- 
tation, which is specified in terms of domain entities, 
into an abstract linguistic representation that speci- 
fies content words and grammatical relationships. I 
will use Mel'cuk's term deep syntactic form for this 
representation. 

All of the systems analyzed possess a deep syntac- 
tic representation; none attempt to go from semantics 
to surface form in a single step. IDAS and PENMAN 
use variants of the same deep syntactic language, SPL 
[ Kasper, 1989 1. fuf and joyce use deep syntactic 
languages that are based (respectively) on functional 
unification and meaning-text theory, but these con- 
vey much the same information as SPL. spokesman 
uses the realization specification language of mumble 
[ McDonald, 1983 1 as its deep syntactic representation; 
I have found it difficult to compare this language to 
the others, but McDonald (personal communication) 
agrees that it conveys essentially the same information 
as SPL. 

Unfortunately, while all of the systems possessed a 
module which converted semantic representations into 
deep syntactic ones, each system used a different name 
for this module. In fuf it is the 'lexical chooser', in 
IDAS it is the 'text planner', in JOYCE it is the 'sentence 
planner', in spokesman it is the 'text structurer', and 
in penman it doesn't seem to have a name at all, 
e.g., Hovy [ 1988| | simply refers to 'pre-generation text- 
planning tasks'. I use the joyce term here because I 
think it is the least ambiguous. 

The specific tasks performed by the sentence planner 
include: 

1. Mapping domain concepts and relations into con- 
tent words and grammatical relations. 

2. Generating referring expressions for individual do- 
main entities. 

3. Grouping propositions into clauses and sentences. 

Relatively little is said in the papers about clause 
grouping and referring-expression generation, but 
more information is available on the first task, map- 
ping domain entities onto linguistic entities. All the 



examined systems except perhaps penman use a vari- 
ant of what I have elsewhere called the 'structure- 
mapping' approach [FLeiter, 1991 |;j I do not know 
what approach penman uses (the papers are not clear 
on this). Structure-mapping is based on a dictio- 
nary that lists the semantic-net equivalents of linguis- 
tic resources [Meteer, 1991] such as content words and 



grammatical relationships. This dictionary might, for 
example, indicate that the English word sister is equiv- 
alent (in the domain knowledge-base of interest) to the 
structure Sibling with attribute Sex:Female, and that 
the domain relation Part-of can be expressed with the 
grammatical possessive, e.g., the car's engine. Given 
this dictionary, the structure-mapping algorithm iter- 
atively replaces semantic structures by linguistic ones, 
until the entire semantic net has been recoded into a 
linguistic structure. There may be several ways of re- 
coding a semantic representation into a linguistic one, 
which means structure-mapping systems have a choice 
between using the first acceptable reduction they find, 
or doing a search for a reduction that maximizes some 
optimality criterion (e.g., fewest number of words). 
The papers I read were not very clear on this issue, 
but I believe that while most of the systems surveyed 
use the first acceptable reduction found, fuf in some 
cases searches for an optimal reduction. 

4.3.1 Design decision: separation of lexical 
choice from surface realization 

The consensus architecture clearly separates lexical 
choice of content words (done during sentence plan- 
ning) from syntactic processing (performed during sur- 
face generation). In other words, it does not use an 
integrated 'lexicogrammar', which systemic theorists 
in particular (e.g., [ Matthiessen, 199l| ]) have argued 
for, and which is implicit in some unification-based ap- 
proaches, such as the semantic head-driven algorithm 
[ [ghicber et al, 1990| . 

Despite these theoretical arguments, none of the 
systems examined used an integrated lexicogrammar, 
including unification-based fuf and systemic-based 
penman. PI In contrast, earlier unification-based sys- 



2 Even though I have previously argued against 
structure-mapping because i t does not do a good job of 



handling lexical preferences [Reiter, 1991 1, I nevertheless 
ended up using this technique when 1 moved from my Ph.D 
research to the more applications-oriented IDAS project. 
Perhaps this is another example of engineering consider- 
ations overriding theoretical arguments. 

3 The penman papers do not explicitly say where lexical 
choice is performed. However, all examples of penman 
SPL input that I have seen have essentially had content 
words already specified, which suggests that lexical choice 
is performed before syntactic processing in penman. 



terns, such as the tactical component of McKeown's 
text system [ McKeown, 1985| ], did integrate lexical 
and syntactic processing in a single 'tactical genera- 
tor'; also, systemic systems that have been less driven 
by application needs than penman, such as GENESYS 



[Fawcett and Tucker, 199C ], have used integrated lexi- 
cogrammars. 

There is psychological evidence that at least some 
lexical processing is separated from syntactic process- 
ing, e.g., the patient mentioned in Section |4.1.1| who 
was able to perform content-determination and syn- 
tactic generation but had a very restricted speaking 
vocabulary. I think it's also very suggestive that hu- 
mans have different learning patterns for content and 
function words; the former are 'open-class' and eas- 
ily learned, while the latter are 'closed-class' and peo- 
ple tend to stick to the ones they learned as children. 
There is less evidence on the location of lexical choice 
in the psycholinguistic pipeline, and on whether it is 
performed in one stage or distributed among several 
stages. 

4.4 Surface Generation 

Surface generation has been used to mean many dif- 
ferent things in the literature. I use it here to 
refer to the portion of the generation system that 
knows how grammatical relationships are actually ex- 
pressed in English (or whatever the target language 
is). For example, it is the surface generator that 
knows what function words and word order relation- 
ships are used in English for imperative, interroga- 
tive, and negated sentences; it is the surface gener- 
ator that knows which auxiliaries are required for the 
various English tenses; and it is the surface generator 
that knows when pronominalization is syntactically re- 
quired (John scolded himself, not John scolded John). 

4.4.1 Design decision: top-down algorithm 
with (almost?) no backtracking 

The grammars and grammar representations used 
by the systems examined are quite different, but all 
systems process the grammars with a top-down algo- 
rithm that uses minimal, if any, backtracking. None 
of the systems use the semantic he ad-driven genera- 
tion algorithm [ Bhieber et al., 1990 1, although this is 
probably the single best-k nown algorithm for surface 
generation; Elhadad [ 1992 , chapter 4] claims that such 
an algorithm is only necessary for systems that at- 
tempt to simultaneously perform both lexical choice 
and surface generation, which none of the examined 
systems do. Perhaps more interestingly, four of the 
five systems do not allow backtracking, and the fifth, 
fuf, allows backtracking but does not seem to use it 



much (if at all) during surface generation (backtrack- 
ing is used in fuf during sentence planning). This is 
interesting, since backtracking is usually regarded as 
an essential component of unification-based generation 
approaches; it is certainly used in the semantic- head- 



driven algorithm, and in the text generator [McKe- 
own 



1985 



From a psycholinguistic perspective, many people 
have argued that human language production is in- 
cremental (see the summary in [Levclt, 1989, pages 
24-27]), which means that of necessity it cannot in- 
clude much backtracking. The garden-path phenom- 
ena shows that there are limits to how much syntactic 
backtracking people people perform during language 
understanding. This evidence is of course suggestive 
rather than definitive; it seems likely that there are 
limitations on how much (if any) backtracking humans 
will perform during syntactic processing (see also the 
arguments in [ McDonald, 1983[ ), but there is no hard 
proof of this (as far as I am aware). 

4.5 Morphology and Formatting 

These modules will not be further examined here, 
mainly because little information is given in the pa- 
pers on the details of how morphology and formatting 
are implemented. 

5 A Controversial (?) View 

I would like to conclude with a perhaps controver- 
sial personal opinion. There have been many cases 
where NL generation researchers (including myself) 
have claimed that a certain linguistic phenomena is 
best handled by a certain architecture. Even if this is 
true, however, if it turns out that adopting this archi- 
tecture will substantially complicate the design of the 
overall generation system, and that the most common 
cases of the phenomena of interest can be adequately 
handled by adding a few heuristics to the appropriate 
stage of a simpler architecture, then the engineering- 
oriented NL worker must ask him- or herself if the 
benefits of the proposed architecture truly outweigh 
its costs. For instance, one cannot simply argue that 
an integrated architecture is superior to a pipeline be- 
cause it is better suited to handling certain kinds of 
pronominalization; it is also necessary to evaluate the 
engineering cost of shifting to an integrated architec- 
ture, and determine if, for example, better overall per- 
formance for the amount of engineering resources avail- 
able could be obtained by keeping the general pipeline 
architecture, and instead investing some of the engi- 
neering resources 'saved' by this decision into building 
more sophisticated heuristics into the pronominaliza- 



tion module. 

In doing so, I believe (and again this is a personal 
belief that probably cannot be substantiated by the ex- 
isting evidence) that the NL engineer is coming close 
to the 'reasoning' of the evolutionary process that cre- 
ated the human language system. Evolution does not 
care about elegant declarative formalisms or 'proper' 
(as opposed to 'hacky') handling of special cases; evo- 
lution's goal is to maximize performance in real-world 
situations, while maintaining an architecture that can 
be easily tinkered with by future evolutionary pro- 
cesses. In short, evolution is an engineer, not a math- 
ematician.]] It is thus perhaps not surprising if NL 
generation systems designed to be used in real-world 
applications end up with an architecture that seem to 
bear some resemblance to the architecture of the hu- 
man language processor]] and future attempts to build 
applications-oriented generation systems may end up 
giving us real insights into how language processing 
works in humans, even if this is not the main purpose 
of these systems. Similarly, psycholinguistic knowl- 
edge of how the human language generator works may 
suggest useful algorithms for NL engineers; one such 
case is described in [Fteitcr and Dale, 1992]. 



Cross-fertilization between psycholinguistics and NL 
engineering will only arise, however, if the results of en- 
gineering analyses are reported in the research litera- 
ture, especially when they suggest going against some 
theoretical principle. Unfortunately, to date the re- 
sults of such analyses have all-too-often been regarded 
more as embarrassments (since they contradict the- 
ory) than as valuable observations, and hence have 
not been published. I would like to conclude this pa- 
per by encouraging generation researchers to regard 
the results of engineering analyses to be as interesting 
and as important to the understanding of language as 
conventional linguistic analyses. After all, as Woods 
[1975| has pointed out, while descriptive analyses of 
language can at best tell us what the brain does, engi- 
neering analyses can potentially offer insights on why 
the brain functions as it does. 



4 Go uld's various p opular books on evolutionary biology, 
such as Gould, 1983J ] , give an excellent feel for evolution as 
an engineer-cum-hackers ; see also the intere sting discussion 
of language and evolution in [Pinker, 1994]. 



J Of course, the best way to do something on a machine 
is often not the best way to do it in nature; e.g., birds and 
airplanes use different mechanisms to fly. On the other 
hand, there does seem to be a remarkable congruence be- 
tween effect ive vision p rocessing strategies in animals and 
computers [ [Marr, 1982| . One could also argue that since 
language (unlike flying) is purely a product of the human 
mind, any effective language processor is probably going to 
have to share some of the mind's processing strategies. 
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